2080.5 - Information Paper: Australian Census Longitudinal Dataset, Methodology and Quality Assessment, 2006-2016 Quality Declaration 
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 20/03/2019   
   Page tools: Print Print Page Print all pages in this productPrint All

2. DATA LINKING METHODOLOGY

The data linking process used to create the ACLD included a series of steps which can be generalised into the following:

  • standardisation of data;
  • preparation of data;
  • blocking;
  • record pair comparison; and
  • decision model.


2.1 STANDARISATION OF DATA

Before records on two datasets are compared, the contents of each need to be as consistent as possible to facilitate comparison. This process is known as 'standardisation' and includes a number of steps including verification, recoding and re-formatting variables, and parsing text variables (i.e. separating text variables into their components). Additionally, some variables such as name may require substantial repair prior to standardisation.

Some variables, such as age, differ between the two datasets in a predictable way, and an adjustment is required to account for this variance. Some variables are coded differently at different points in time, and concordances may be necessary to create variables which align on the two datasets. Variables may also be recoded or aggregated in order to obtain a more robust form of the variable. Standardisation takes place in conjunction with a broader evaluation of the dataset, in which potential linking variables are identified.

The standardisation procedure for the ACLD linkage project involved coding imputed and invalid values for selected variables to a missing value. These variables include name, address, day of birth, month of birth, year of birth, age, sex, year of arrival and marital status. Standardisation for hierarchical variables involved collapsing higher levels of aggregation to minimise disagreement when linking records which may have had a small intercensal change or to account for potential differences in the coding of the variable. This allows for records to agree using broader categories rather than disagree on specific information that may have changed over time or be reported and/or coded inconsistently. An example of this is country of birth. Whereas in 2011 the respondent may have been coded to 'Northern Europe' (two digit level of country of birth), in 2016 they may have reported a specific country such as 'England' or 'Norway' (four digit level of country of birth). If left in its original state, a comparison between 'Northern Europe' and 'England' would not agree, even though one is a sub-category of the other. Both two digit and four digit versions of country of birth were used in the linkage, albeit in different passes. This improved the quality of the linkage while also increasing the chances that a link would be made.

The following is a description of specific standardisation techniques that were performed on variables for this project.

First Name and Surname (applies to the 2011 and 2016 Censuses)

In the 2011 Census, name data was first subjected to an automated repair process. Both first names and surnames were compared against corresponding master name indexes, with names being repaired when a suitably close match to a value on the index was found. The name repair process was repeated for the 2016 Census data, with the addition of a number of enhancements. These enhancements optimised both the number and accuracy of names repaired, and included the following:
  • expanded first name index, enabling a larger amount of names to be repaired;
  • identification and removal of values that were not considered to be a valid name;
  • use of age and country of birth information to assist with repairing first names;
  • use of family structure to assist with repairing surnames;
  • targeted automatic repair processes based on response type (i.e. online forms vs. paper forms);
  • a manual coding process for paper form responses that could not be sufficiently repaired through automatic means (note that manual repair was performed only for a subset of records in 2011); and
  • retention of additional repaired name options when more than one close match for a name value was found on the index.

After repair, first names were then compared against a nickname concordance, ensuring that different variations would be grouped into a common name for the purposes of linkage. The standardisation of the same name value may also vary depending on the reported sex. For example, a female with the name 'Jess' may be standardised to 'Jessica' whereas it may be standardised to 'Jesse' for a male. Any first names that could not be matched to a nickname retained their original form. An identical nickname concordance was used in both the 2011 and 2016 Censuses, ensuring that the name values from both Censuses were consistently standardised. The nickname concordance process was not performed for surnames.

After Census names had been repaired and standardised, they were converted into anonymous hash codes to be used in the linking process. These encoded versions of Census name served to assist in further protecting the privacy of Census respondents when linking 2011 and 2016 Census records in the ACLD. No name information was retained from the 2006 Census, so no name information was used in linking 2006 and 2011 Census records.

The hash codes are created by grouping people with a combination of letters from their first and last names using a secure one-way process, meaning that a code cannot be reversed to deduce the original name information. Each code represents approximately 2,000 people drawn from many different letter combinations, and therefore is not unique to an individual. Encoding of 2011 and 2016 Census name information was undertaken during the processing of each respective Census. The encoded name information is retained by the ABS for linking purposes. Original Census name information was not used in linking the ACLD.

The codes are only accessible to those ABS officers creating the linked dataset, and will never be released outside the ABS.

Geography

As a proportion of the Australian population is expected to change their residential address between Censuses, the ACLD uses geographic data expected to refer to the same time point in the linking process. Usual address from one Census is compared to usual address five years ago from the subsequent Census. Additionally, the following standardisation techniques are applied:
  • imputed address geography is removed for linking purposes, for example a mesh block which was imputed will not be used for linking, though it will remain on the analytical file;
  • usual address from a particular Census was converted to the Australian Statistical Geography Standard (ASGS) that is relevant for the subsequent Census to allow for accurate comparison of Census address data. For example, 2011 Census address was converted to the 2016 ASGS;
  • where a 2006 ASGS unit had been split into multiple 2011 ASGS units, only the 2011 ASGS unit with the most overlap was retained for linkage, while in instances where a 2011 ASGS unit had been split into multiple 2016 ASGS geographic units, up to three options were retained for linkage. For further information on changes to ASGS geography refer to Australian Statistical Geography Standard (ASGS): Volume 1 - Main Structure and Greater Capital City Statistical Areas, July 2011 (cat. no. 1270.0.55.001) for 2011 ASGS or Australian Statistical Geography Standard (ASGS): Volume 1 - Main Structure and Greater Capital City Statistical Areas, July 2016 (cat. no. 1270.0.55.001) for 2016 ASGS; and,
  • to increase the chance of linking on geography and minimise the impact of respondent recall error concerning address five years ago, where a 2011 or 2016 Census respondent reported a different usual address or address one year ago to address five years ago, these addresses were used as alternative options in the linking process when linking to the previous Census.

Personal Characteristics
  • Invalid dates of birth were removed.
  • Imputed instances of sex, marital status and age were removed.
  • Age increased by 5 years on the initial Census involved in the respective linkage. For example, when linking the 2006 Panel to the 2011 Census, records from 2006 had their age increased by 5. The same was done for linkage from 2011 to 2016 records.
  • Country of birth coded to the two digit level, for example 'Western Europe', to improve chances of linkage. Four digit country of birth, for example 'Austria' and 'Germany' was also retained to increase quality of links that agreed on this level of geography. Each level of country of birth was used in different passes.
  • Indigenous status standardised to group 'Aboriginal', 'Torres Strait Islander' and 'Both Aboriginal and Torres Strait Islander' as one unique response.
  • Marital status standardised to group 'Divorced', 'Separated' and 'Widowed' as one unique response. In addition, marital status was coded to missing for persons under 15 years of age.


2.2 PREPARATION OF DATA

An additional data preparation technique was used for Census records where multiple responses had been provided for key linking variables. A record may have had multiple responses for a single linking variable in the following situations:
  • a name that required repair returned more than one possible repaired name value from the applied process (2016 Census records only);
  • the respondent reported different locations for address of usual residence, address one year ago and address five years ago (only for records in the most recent Census involved in a particular linkage, e.g. for 2016 Census records when linking to 2011 Census records ); or,
  • the process of aligning 2011 Census usual address geographies to 2016 ASGS values resulted in more than one possible set of these values for some records.

The process for allowing the use of multiple responses for a linking variable involved restructuring the data for affected records; multiple rows were created, with the number of rows generated equal to the number of different combinations that could be created from the linkage information. This is demonstrated in Tables 1a and 1b below. A respondent with two different encoded first name values and two different mesh blocks would have four permuted rows generated. Meanwhile, the information that only has one stated value (in this example surname and date of birth) is duplicated across all of the generated rows. Structuring the data in this manner allows for all combinations of a respondent's linkage information to be considered in a highly efficient manner. Permutation of name was only used for linkages of 2016 Census records to 2011 Census records as name data was not retained for the 2006 Census. Permutation of data was not used for the original 2006 Panel linkage.

TABLE 1A - EXAMPLE OF DATA RESTRUCTURE, Original Record
Person ID
Encoded First Name 1
Encoded First Name 2
Encoded Surname 1
Encoded Surname 2
Encoded Mesh Block 1
Encoded Mesh Block 2
Date of Birth

1
1234
5678
9876
12345670000
98765430000
09/08/2016


TABLE 1B - EXAMPLE OF DATA RESTRUCTURE, Restructured Record
Person ID
Encoded First Name
Encoded Surname
Mesh Block
Date of birth

1
1234
9876
12345670000
09/08/2016
1
1234
9876
98765430000
09/08/2016
1
5678
9876
12345670000
09/08/2016
1
5678
9876
98765430000
09/08/2016



2.3 RECORD PAIR COMPARISON

There were two different linking methods utilised in the linkage of the ACLD, deterministic and probabilistic. Deterministic linkage methods were initially used to identify matches that had high quality linking information. Probabilistic linking was then used in subsequent passes for the records that had not been linked in the deterministic passes. Probabilistic passes were linked non-sequentially (this method is explained further in Section 2.3.2 Probabilistic Linking).

2.3.1 Deterministic Linking


Deterministic data linkage, also known as rule-based linkage, involves assigning record pairs across two datasets that match exactly or closely on common variables. This type of linkage is most applicable where the records from different sources consistently report sufficient information to efficiently identify links. It is less applicable in instances where there are problems with data quality or where there are limited characteristics.

Initially, a deterministic linkage method was used to identify matches that contained high quality linking information. This involved using selected personal and demographic characteristics (first name hash code, surname hash code, sex, date of birth/age, mesh block and country of birth), to identify the highest quality record pairs that matched exactly on these characteristics. These links were accepted and exempted from the following probabilistic linkage passes. This method identified approximately 75% of links for each of the 2006-2011 Census and 2011-2016 Census linkages with the remaining 25% identified via probabilistic linking.

2.3.2 Probabilistic Linking

Probabilistic linking allows links to be assigned in spite of missing or inconsistent information, providing there is enough agreement on other variables to offset any disagreement. In probabilistic data linkage, records from two datasets are compared and brought together using several variables common to each dataset (Fellegi & Sunter, 1969).

A key feature of the methodology is the ability to handle a variety of linking variables and record comparison methods to produce a single numerical measure of how well two particular records match, referred to as the 'linkage weight'. This allows ranking of all possible links and optimal assignment of the link or non-link status (Solon and Bishop, 2009).

Blocking variables

In probabilistic linkage, record pairs (consisting of one record from each file) can be compared to see whether they are likely to be a match, i.e. belong to the same person. However, if the files are even moderately large, comparing every record on File A with every record on File B is computationally infeasible. Blocking reduces the number of comparisons by only comparing record pairs where matches are likely to be found – namely, records which agree on a set of blocking variables. Blocking variables are selected based on their reliability and discriminatory power. For instance, sex is partially useful as it is typically well reported, however it is minimally informative as it only divides datasets into two blocks, and therefore does not sufficiently reduce the computational intensity of larger linkages. Accordingly, it is generally not used alone but rather in conjunction with other variables.

Comparing only records that agree on one particular set of blocking variables means a record will not be compared with its match if it has missing, invalid or legitimately different information on a blocking variable. To mitigate this, the linking process is repeated a number of times ('passes'), using a range of different blocking strategies. For example, on the first probabilistic pass of the 2011-2016 linking strategy, a block using a fine level of geography (mesh block) was used to capture the majority of 2011 Census records that had matching information with their corresponding 2016 Census record. The second probabilistic pass blocked on a slightly broader level of geography (SA1), to capture records which disagreed on mesh block, but had matching information at the higher geographic level. The blocking variables used for each pass are outlined in Section 2.4 Blocking and Linking Strategy used in the ACLD.

Linking variables

Within a blocking pass, records on the two files which agree on the specified blocking variables are compared on a set of linking variables. Each linking variable has associated field weights, which are calculated prior to comparison. Field weights indicate the amount of information (agreement, disagreement, or missing values) a linking variable provides about whether or not the records belong to the same person (match status). Field weights are based on two probabilities associated with each linking variable: first, the probability that the field values agree given that the two records belong to the same person (match); and second, the probability that the field values agree given the two records belong to different persons (non-match). These are called m and u probabilities (or match and non-match probabilities) and are defined as:

m = P(fields agree | records belong to the same person)
u = P(fields agree | records belong to different people)

Given that the m and u probabilities require knowledge of the true match status of record pairs, they cannot be known exactly, but rather must be estimated. The ABS calculated the m and u probabilities based on a training dataset, under the assumption that each deterministic link on the dataset was a match. The deterministic links used in this phase included (1) the highest quality links accepted in the deterministic linking passes, and (2) additional slightly lower quality links expected to be confirmed in the probabilistic linking phase. This method estimated the likelihood that a record would have a match by taking deaths and net overseas migration into account when estimating the m and u probabilities. This method also generated probabilities for disagreement, which can be referred to as md and ud probabilities:

md = P(fields disagree | records belong to the same person)
ud = P(fields disagree | records belong to different people)

Note that m and u probabilities are calculated separately for each pass, as the probabilities depend upon the characteristics of the pass' blocking variables. For example, the m probability for country of birth when blocking on mesh block will be different to the m probability for country of birth when blocking on sex.

Match (m) and non-match (u) probabilities are then converted to agreement and disagreement field weights. They are as follows:

Agree = log2(m/u)
Disagree = log2(md/ud)

These equations give rise to a number of intuitive properties of the Fellegi–Sunter framework (Fellegi & Sunter, 1969). First, in practice, agreement weights are always positive and disagreement weights are always negative. Second, the magnitude of the agreement weight is driven primarily by the likelihood of chance agreement. That is, a low probability of two random people agreeing on a variable (for example, date of birth) will result in a large agreement weight being applied when two records do agree.

The magnitude of the disagreement weight is driven by the stability and reliability of a variable. That is, if a variable is well reported and stable over time (for example, sex) then disagreement on the variable will yield a large negative weight. For each record pair comparison, the field weights from each linking variable are summed to form an overall record pair comparison weight or 'linkage weight'.

Before calculating m and u probabilities for some variables it is first necessary to define what constitutes agreement. Typical comparison functions used in the ACLD linkage include:
  • Exact match (e.g. Sex). Agreement occurs only when the two variable values are identical. This criterion is used for most linking variables; and
  • Numeric difference (e.g. Age). A pair may be defined to agree if their variable values differ by an amount less than or equal to a specified maximum difference.

For further details on comparison functions used for probabilistic linkage, see Christen & Churches (2005).

Near or partial agreement may also be factored into the linking process through calculation of m and u probabilities accounting for such agreement. For example, a person’s age on equivalent records will frequently be an exact match, and the m and u probabilities are calculated based on this definition. During linkage, however, a partial agreement weight was given for age within one year difference to cater for persons who may have understated their age in one Census and/or overstated it in the following Census or vice versa.

Blocking variables, linking variables, comparator types, and m and u probabilities are used as input parameters for the linking software. Records which agree on the blocking variable(s) are compared on all linking variables.


2.4 BLOCKING AND LINKING STRATEGY USED IN THE ACLD

The strategy employed for the re-link of the 2006 Panel to the 2011 Census and linking of both ACLD panels to the 2016 Census builds on the original 2006-2011 ACLD linking strategy, using developments in linking methodology, software and available data to improve the approach. For further details on the original 2006-2011 ACLD linkage see Linkage Results.

To develop the linking strategy to be used for the 2011-2016 ACLD linkage, the 2006 Panel of the ACLD was re-linked to the 2011 Census as part of an investigation into the feasibility of proposed methodological enhancements. While the re-link of the 2006 ACLD Panel sample could not make use of hash encoded name (as the 2011-2016 linkage would benefit from), it was found that improvements could be made on the original linkage with regards to the estimated precision and accuracy of the links achieved. The enhanced linking strategy was then implemented for both the 2006-2011 and 2011-2016 linkages.

The key features of the enhanced linking strategy used include:
  • a combination of deterministic and probabilistic linkage techniques designed to link a high quality dataset that is representative of the Australian population;
  • linking variables found to contain unacceptably low levels of consistent reporting over time, such as highest year of schooling and occupation, were removed from the linking strategy;
  • certain passes were designed to link particular population groups in order to improve linkage rates, such as Aboriginal and Torres Strait Islander peoples, migrants, and children;
  • use of a non-sequential approach to probabilistic linking. The sequential approach used for the original 2006-2011 ACLD linkage removed accepted links after each probabilistic pass, resulting in the successful identification of true matches being dependent on the order of the passes. The non-sequential approach allows for all records to be given an opportunity to link in every probabilistic pass, preventing poorer quality links from earlier passes being accepted where a higher quality link could be found in a later pass; and,
  • blocking weights were applied to each of the probabilistic passes to standardise the linkage weights for all potential record pairs. This allowed all potential links to be comparable across passes to determine the best possible link for each record.

Table 2a displays the original linking strategy used for the 2006-2011 linkage for reference. Tables 2b and 2c display the blocking and linking variables applied in the 2006-2011 (re-link) and 2011-2016 linkages for each pass.

TABLE 2a - BLOCKING AND LINKING VARIABLES
, By Pass Number,
2006 Panel, 2006-2011 linkage (original)

PASS NUMBER (a)(b)(c)
1
2
3
4
5
6
7
8
9
10(d)
11
12

PERSONAL INFORMATION

Age
B
B
L
L
L
L
L
L
L
L
L
L
Sex
B
B
B
B
B
B
B
B
L
L
L
L
Day and Month of Birth
B
B
B
B
B
L
L
L
L
L
L
Indigenous status
B
B
B
L
B
L
L
L
L
Country of Birth
L
L
L
L
L
L
L
Year of Arrival
L
L
L
L
L
L
Marital status
L
L
L
L
L
Level of Qualification
L
L
L
L
L
Field of Qualification
L
L
L
L
L
L
L
L
Highest Level of Schooling
L
L
L
L
L
Occupation
L
L
Language spoke at home
L
L
L
L
L
Religion
B
L
L
L
Aged less than 15 block
B
B
B

HOUSEHOLD INFORMATION

Mother's Age
B
L
L
Mother's Day and Month of Birth
B
L
L
Father's Age
L
Father's Day and Month of Birth
L
Family ID block
B

GEOGRAPHICAL INFORMATION

Mesh Block
B
B
B
B
SA1
B
B
SA2
B
B
SA4
B
B
B

(a) Passes 1 and 2 refer to the deterministic linking passes while passes 3-12 refer to the probabilistic linkage passes.
(b) B – blocking variable
(c) L – linking variable
(d) The results of Pass 10 were used to identify the blocking field to be used in Pass 11. As a result, there were no records output from Pass 10.

TABLE 2b - BLOCKING AND LINKING VARIABLES, By Pass Number,
2006 Panel, 2006-2011 linkage (re-link)

PASS NUMBER (a)(b)(c)
1
2
3
4
5
6
7
8

PERSONAL INFORMATION
Age
B
+/- 1
L
L
L
Sex
B
L
L
L
B
B
B
Day and Month of Birth
B
L
L
B
B
B
Year of Birth
B
B
B
B
B
Indigenous status
B
L
L
B
L
L
Country of Birth
B
B
Year of Arrival
L
L
L
L
L
Marital status
Language spoke at home
L
L
L
Religion
L
L
Aged less than 15 block
B

HOUSEHOLD INFORMATION
Mother's Age
L
Mother's Day and Month of Birth
L
Mother's Country of Birth
L
L
Father's Age
L
Father's Day and Month of Birth
L
Father's Country of Birth
L
L

GEOGRAPHICAL INFORMATION
Mesh Block
B
B
SA1
B
SA2
B
SA4
B
B
B


(a) Pass 1 refers to the deterministic linkage passes while passes 2-8 refer to the probabilistic linkage passes.
(b) B – blocking variable
(c) L – linking variable

TABLE 2c - BLOCKING AND LINKING VARIABLES, By Pass Number,
2006 and 2011 Panels, 2011-2016 linkage

PASS NUMBER (a)(b)(c)
1
2
3
4
5
6
7
8
9

PERSONAL INFORMATION
First name hash
B
L
L
L
L
L
L
L
L
Surname hash
B
L
L
L
L
L
L
L
L
Age
B
+/- 1
L
L
L
Sex
B
L
L
B
B
B
B
B
Day and Month of Birth
B
L
L
B
B
B
B
Year of Birth
B
B
B
B
B
Indigenous status
B
L
L
L
L
L
Country of Birth
B
Year of Arrival
L
L
L
L
L
Marital status
Language spoke at home
L
L
L
Religion
L
L
L
Aged less than 15 block
B
B

HOUSEHOLD INFORMATION
Mother's Age
L
Mother's Day and Month of Birth
L
Mother's Country of Birth
L
L
Father's Age
L
Father's Day and Month of Birth
L
Father's Country of Birth
L
L

GEOGRAPHIC INFORMATION
Mesh Block
B
B
SA1
B
SA2
B
B
SA4
B
B
B


(a) Pass 1 refers to the deterministic linkage passes while passes 2-9 refer to the probabilistic linkage passes.
(b) B – blocking variable
(c) L – linking variable

2.5 DECISION MODEL

In deterministic linking, an exact match is required on each of the variables specified in the blocking and linking strategy (see Passes 1 and 2 in Table 2a, and Pass 1 in Tables 2b and 2c). Using this approach, links were only accepted where a unique single record pair was identified. Where a record was included in more than one possible pair, it was returned to the pool of unlinked records for subsequent probabilistic passes.

In probabilistic linking, once potential record pairs are generated and weighted, a decision algorithm determines whether the record pair is linked, not linked or should be considered further as a possible link. The generation of potential record pairs from probabilistic linking can result in the records on one dataset linking to multiple records on the other, resulting in a file of 'many-to-many' potential links. The first phase of the decision process involves assigning a record to its best possible pairing. This process is known as one-to-one assignment. Ideally (and often true in practice) each record has a single, unique best pairing, which is its true match.

In the past, ABS probabilistic linking projects (including the original 2006 Panel linkage) have typically used an auction algorithm to assign links optimally from the pool of all possible links. The auction algorithm maximises the sum of all the record pair comparison weights through alternative assignment choices, such that if a record A1 on File A links well to records B1 and B2 on File B, but record A2 links well to B2 only, the auction algorithm will assign A1 to B1 and A2 to B2, to maximise the overall comparison weights for all record pairs.

For the 2016 ACLD linkage, a change was made to the assignment algorithm. Using the previous example, A1 may still link to B1, but A2 would only be able to link to B2 if it was a better link than A1 to B2. This change ensured that links would only be assigned when they are the absolute best option for both records in the link, which subsequently improved the quality of the links output at this phase. The modified algorithm was also far more efficient than the auction method, with the assignment process completed in a matter of minutes compared to several hours or days when using the auction algorithm. An additional change made for this ACLD linkage was that the one-to-one assignment was run using the combined many-to-many results from all passes in the linkage, rather than running the assignment over the results from each individual pass. This allowed the best links from all passes to be obtained from a single assignment procedure.

The second phase of the probabilistic decision rule stage takes the output of one-to-one assignment and decides which pairs should be retained as links, and which pairs should be rejected as non-links. The simplest decision rule uses a single 'cut-off' point, where all record pairs with a linkage weight above or at the cut-off are assigned as links, and all those pairs with a linkage weight below the cut-off are assigned as non-links. The best approach to assigning a single cut-off point is to clerically review links; however this process is time and resource intensive.


Model-based method

As clerical review was unavailable for the re-link of the 2006 panel to the 2011 Census due to data availability limitations, an alternative method of measuring precision and setting a cut-off was undertaken through the use of models. The method of Chipperfield et al (2018) was applied to provide an independent model-based estimate of the precision. The expected performance of this method was investigated using the 2011-2016 linkage and the results of clerical review undertaken for that linkage. While the clerical estimate of cumulative precision for the 2011-2016 linkage was 98.6%, the model-based approach estimated the precision to be over 99.0%. These results showed that the use of models was a viable option to generate a comparable estimate of precision where clerical review was not available. This model was used as the primary method of calculating precision and setting a cut-off for the 2006-2011 re-link for the 2006 Panel sample. Due to the lack of name information for the 2006 Census, the ability to distinguish a unique link became more difficult. To ensure a high quality linkage while maintaining a high linkage rate, the desired estimated cumulative precision was set at 95%, or an estimated false link rate of approximately 5%. This method achieved a 77.2% linkage rate when linking the 2006 Panel to 2011 Census records.


Clerical Review Method

In order to establish the cut-off point for the original 2006-2011 linkage and the 2011-2016 linkages, a sample of the record pairs were clerically reviewed. This provided the opportunity to ascertain quality levels and enabled an estimate of the number of 'false links', which are links formed that are believed to belong to separate entities (i.e. persons) rather than the same entity.

For the ACLD project, a sample of record pairs was clerically reviewed to set a single cut-off for the set of one-to-one links. Each sampled record pair was manually inspected to resolve its match status (i.e. if the link was 'true' or 'false'). As part of this process, a clerical reviewer was often able to use information which cannot be captured in the automated comparison process, but could be identified by the reviewer, such as common transcription errors (e.g. 1 and 7) or transposed information, such as the day of birth reported as the month or vice versa. This information was only available for the 2011 Census when conducting the original 2006 Panel linkage and for the 2016 Census when conducting the 2011-2016 linkages.

In addition to the linking variables, supplementary information was also used to confirm a link as true. This included:
  • non-linking variables such as ancestry, occupation, schooling and qualification; and,
  • reviewing the dates of birth and country of birth of parents (when available) for child records that had been linked.

These supplementary variables helped to clarify difficult decisions, especially on record pairs belonging to children, allowing for greater insight into whether a record pair was an actual match or just contained similar demographic and personal characteristics for two different individuals.

After completing the sample review, the results were used to set a single cut-off point for the 2011 Census to 2016 Census linkages, designed to assign a high proportion of links with high level of quality to the final linked dataset. This method achieved final 2011-2016 linkage rates of 80% for the 2006 Panel and 76% for the 2011 Panel.